Skip to content

engine: keep SimpleEngine serialized across cancellation#8

Merged
krystophny merged 1 commit intomainfrom
fix/simple-engine-cancel-serialization
Mar 24, 2026
Merged

engine: keep SimpleEngine serialized across cancellation#8
krystophny merged 1 commit intomainfrom
fix/simple-engine-cancel-serialization

Conversation

@krystophny
Copy link
Copy Markdown
Collaborator

Summary

  • keep the SimpleEngine generation lock held until cancelled asyncio.to_thread(...) workers actually finish
  • cover the race with a focused regression test

Reproducer

FortBench hit this on Apple Silicon during local Qwen runs:

  • a non-streaming /v1/responses request was cancelled by the disconnect guard
  • the request task exited, but the blocking MLX worker thread was still running
  • the next request entered Metal before the first worker drained
  • MLX/Metal aborted with failed assertion A command encoder is already encoding to this command buffer``

Fix

SimpleEngine now routes blocking MLX calls through a cancellation-safe helper that:

  • acquires _generation_lock
  • runs the blocking call in asyncio.to_thread(...)
  • on cancellation, waits for that worker task to finish before releasing the lock

This keeps MLX/Metal single-threaded even when the client disconnects.

Tests

  • python3 -m unittest tests.test_simple_engine_cancel_serialization -v
  • python3 -m compileall vllm_mlx tests

@krystophny krystophny merged commit 892e5bf into main Mar 24, 2026
@krystophny krystophny deleted the fix/simple-engine-cancel-serialization branch March 24, 2026 14:34
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Keep SimpleEngine serialized across request cancellation

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Introduce _run_blocking_serialized() helper to prevent MLX/Metal race conditions
• Ensure generation lock held until worker thread completes, even on cancellation
• Replace direct asyncio.to_thread() calls with cancellation-safe wrapper
• Add regression test verifying no concurrent MLX access after request cancellation
Diagram
flowchart LR
  A["Request cancelled"] --> B["_run_blocking_serialized acquires lock"]
  B --> C["asyncio.to_thread starts worker"]
  C --> D["Cancellation caught"]
  D --> E["Wait for worker to finish"]
  E --> F["Release lock safely"]
  F --> G["Next request enters MLX"]
Loading

Grey Divider

File Changes

1. tests/test_simple_engine_cancel_serialization.py 🧪 Tests +80/-0

Regression test for cancellation-safe serialization

• New regression test file with SimpleEngineCancelSerializationTests class
• Tests that cancellation does not release generation lock before worker finishes
• Verifies concurrent MLX access counter stays at 1 (no overlap)
• Uses threading events and mocks to simulate concurrent generate calls

tests/test_simple_engine_cancel_serialization.py


2. vllm_mlx/engine/simple.py 🐞 Bug fix +103/-93

Implement cancellation-safe serialized blocking operations

• Add new _run_blocking_serialized() method that wraps blocking calls with cancellation safety
• Method acquires _generation_lock, runs blocking operation in thread, and waits for completion on
 cancellation
• Refactor generate() to use _run_blocking_serialized() instead of direct asyncio.to_thread()
• Refactor chat() to use _run_blocking_serialized() for both MLLM and LLM paths
• Refactor stream_chat() MLLM path to use _run_blocking_serialized() and move lock outside async
 context
• Refactor stream_generate() to use _run_blocking_serialized() for all worker calls

vllm_mlx/engine/simple.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Mar 24, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. MTP stream deadlock 🐞 Bug ⛯ Reliability
Description
_stream_generate_text() holds _generation_lock and now awaits _run_blocking_serialized(),
which tries to acquire the same lock again, deadlocking MLLM+MTP text-only streaming. Text-only MLLM
requests routed to this path will hang indefinitely.
Code

vllm_mlx/engine/simple.py[1199]

+            all_resps = await self._run_blocking_serialized(_run_all)
Evidence
_run_blocking_serialized() acquires _generation_lock unconditionally. _stream_generate_text()
wraps its generation body in async with self._generation_lock: and later calls `all_resps = await
self._run_blocking_serialized(_run_all)` within that block, causing nested acquisition of the same
asyncio.Lock and a deadlock.

vllm_mlx/engine/simple.py[229-246]
vllm_mlx/engine/simple.py[1012-1014]
vllm_mlx/engine/simple.py[1194-1201]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_stream_generate_text()` currently acquires `self._generation_lock` and then calls `_run_blocking_serialized()`, which also acquires `self._generation_lock`. This deadlocks.

### Issue Context
This path is used for MLLM+MTP routing (text-only requests). The PR moved lock acquisition into `_run_blocking_serialized()`, so call sites that already hold the lock must be adjusted.

### Fix Focus Areas
- vllm_mlx/engine/simple.py[229-246]
- vllm_mlx/engine/simple.py[1012-1014]
- vllm_mlx/engine/simple.py[1194-1201]

### Implementation direction
Refactor `_stream_generate_text()` so `_run_all` (and thus MLX/Metal calls) run under `_run_blocking_serialized()` without an additional surrounding `async with self._generation_lock:` block; or add a dedicated cancellation-safe helper that does **not** reacquire the lock when the caller already holds it.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Flaky cancellation test 🐞 Bug ⛯ Reliability
Description
The new regression test doesn’t assert that first_started.wait() succeeded, so it can proceed even
if the first worker never started, weakening the test and making failures/pass conditions
misleading. This can cause false positives or intermittent behavior under scheduler variance.
Code

tests/test_simple_engine_cancel_serialization.py[R59-63]

+            await asyncio.to_thread(first_started.wait, 1.0)
+
+            task1.cancel()
+            task2 = asyncio.create_task(engine.generate(prompt="second", max_tokens=8))
+
Evidence
threading.Event.wait() returns a boolean indicating whether it was set before timeout, but the
test ignores it and immediately cancels task1 and starts task2. If the first worker didn’t start
within the timeout, the test is no longer validating the intended race.

tests/test_simple_engine_cancel_serialization.py[58-63]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The test should fail fast if the first worker never started; otherwise it may test a different scenario than intended.

### Issue Context
`threading.Event.wait(timeout)` returns `True/False`. The test currently ignores this return value.

### Fix Focus Areas
- tests/test_simple_engine_cancel_serialization.py[58-63]

### Implementation direction
Capture the boolean result and assert it is `True`, e.g.:
- `started = await asyncio.to_thread(first_started.wait, 1.0)`
- `self.assertTrue(started, "first worker did not start in time")`

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Cancelled worker errors hidden 🐞 Bug ✧ Quality
Description
When the request is cancelled, _run_blocking_serialized() waits for the worker thread but
suppresses any worker exception without logging, making MLX/Metal failures during cancellation hard
to diagnose. This reduces observability of precisely the failure mode the PR is trying to mitigate.
Code

vllm_mlx/engine/simple.py[R240-245]

+            except asyncio.CancelledError:
+                try:
+                    await task
+                except Exception:
+                    pass
+                raise
Evidence
In the cancellation handler, exceptions from the worker task are caught with except Exception: and
dropped (pass) without logging, so a worker crash during cancellation becomes invisible in logs
while the caller only sees CancelledError.

vllm_mlx/engine/simple.py[229-245]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
If the worker raises while the outer task is cancelled, the exception is swallowed, harming debuggability.

### Issue Context
The code correctly waits for the worker to finish to keep MLX/Metal serialized, but should still emit diagnostics if the worker failed.

### Fix Focus Areas
- vllm_mlx/engine/simple.py[229-245]

### Implementation direction
In the `except asyncio.CancelledError:` block, change the bare `except Exception: pass` into a log statement (e.g., `logger.warning(..., exc_info=True)`) before re-raising the cancellation.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread vllm_mlx/engine/simple.py
cleanup_rope(model)

all_resps = await asyncio.to_thread(_run_all)
all_resps = await self._run_blocking_serialized(_run_all)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Mtp stream deadlock 🐞 Bug ⛯ Reliability

_stream_generate_text() holds _generation_lock and now awaits _run_blocking_serialized(),
which tries to acquire the same lock again, deadlocking MLLM+MTP text-only streaming. Text-only MLLM
requests routed to this path will hang indefinitely.
Agent Prompt
### Issue description
`_stream_generate_text()` currently acquires `self._generation_lock` and then calls `_run_blocking_serialized()`, which also acquires `self._generation_lock`. This deadlocks.

### Issue Context
This path is used for MLLM+MTP routing (text-only requests). The PR moved lock acquisition into `_run_blocking_serialized()`, so call sites that already hold the lock must be adjusted.

### Fix Focus Areas
- vllm_mlx/engine/simple.py[229-246]
- vllm_mlx/engine/simple.py[1012-1014]
- vllm_mlx/engine/simple.py[1194-1201]

### Implementation direction
Refactor `_stream_generate_text()` so `_run_all` (and thus MLX/Metal calls) run under `_run_blocking_serialized()` without an additional surrounding `async with self._generation_lock:` block; or add a dedicated cancellation-safe helper that does **not** reacquire the lock when the caller already holds it.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@krystophny
Copy link
Copy Markdown
Collaborator Author

krystophny commented Mar 26, 2026

Benchmark: Simple Engine vs Continuous Batching

Qwen3.5-2B on Apple Silicon (MLX), 128 max_tokens, streaming, 5 iterations per configuration.
Values are mean +/- stdev across iterations.

Aggregate throughput (tok/s) — total completion tokens / wall clock

Concurrency Simple Engine Continuous Batching
1 120 +/- 44 142 +/- 57
2 96 +/- 7 144 +/- 11
4 95 +/- 1 173 +/- 6
8 95 +/- 1 225 +/- 3

Wall clock (seconds) — time from first request to last response

Concurrency Simple Engine Continuous Batching
1 1.21 +/- 0.52 1.08 +/- 0.58
2 2.67 +/- 0.19 1.79 +/- 0.13
4 5.38 +/- 0.05 2.23 +/- 0.08
8 10.79 +/- 0.06 3.89 +/- 0.05

Per-request throughput (tok/s) — individual request completion tokens / elapsed

Concurrency Simple Engine Continuous Batching
1 120 +/- 44 142 +/- 57
2 68 +/- 23 72 +/- 5
4 46 +/- 24 47 +/- 19
8 30 +/- 23 30 +/- 7

TTFT — time to first content token

Measured from request send to first non-empty SSE delta. Includes prompt processing.

Concurrency Metric Simple Engine Continuous Batching
1 min 1096 +/- 458 ms 1075 +/- 579 ms
1 max 1096 +/- 458 ms 1075 +/- 579 ms
2 min 1196 +/- 220 ms 1789 +/- 134 ms
2 max 2668 +/- 190 ms 1789 +/- 134 ms
4 min 1228 +/- 38 ms 129 +/- 27 ms
4 max 5379 +/- 48 ms 2231 +/- 82 ms
8 min 1392 +/- 193 ms 181 +/- 17 ms
8 max 10787 +/- 57 ms 3889 +/- 50 ms

Analysis

Throughput scaling:

  • Simple engine aggregate throughput is flat at ~95 tok/s regardless of concurrency (2-8). Requests serialize behind the generation lock — adding concurrent requests only adds queue wait.
  • Continuous batching scales: 142 -> 144 -> 173 -> 225 tok/s from 1 to 8 concurrent requests. At 8 concurrent, batching delivers 2.4x higher aggregate throughput (225 vs 95 tok/s).

Wall clock:

  • Simple engine wall clock scales linearly: 8 concurrent requests take 10.8s (= 8 x ~1.35s per request, serialized).
  • Continuous batching at 8 concurrent: 3.9s wall clock — 2.8x faster than simple engine.

Per-request throughput:

  • Nearly identical between engines at every concurrency level. Each individual request gets the same bandwidth — the difference is whether others can run concurrently.

TTFT behavior:

  • Simple engine: min TTFT stays ~1.2-1.4s regardless of concurrency (first-in-queue gets served immediately). Max TTFT grows linearly: 2.7s at conc=2, 5.4s at conc=4, 10.8s at conc=8 (last request waits for all preceding ones).
  • Continuous batching: at higher concurrency, min TTFT drops dramatically (129ms at conc=4, 181ms at conc=8) because short-response requests (e.g., "capital of France" -> 2 tokens) finish almost immediately within the batch. Max TTFT is the longest-generating request in the batch (2.2s at conc=4, 3.9s at conc=8).
  • Key difference: simple engine TTFT worst-case is sum of all preceding requests. Batching TTFT worst-case is just the longest individual request in the batch.

Bottom line: the simple engine serialization from this PR is correct behavior for the non-batched code path. Users needing concurrent request handling should use --continuous-batching.

krystophny added a commit that referenced this pull request Apr 14, 2026
)

* fix: keep simple engine serialized across cancellation (#8)

* fix: avoid nested simple engine generation locks

* fix: catch BaseException in cancellation handler, fix async test markers

_run_blocking_serialized catches CancelledError (a BaseException subclass)
from the outer scope, but the inner try/except used Exception which would
let a second CancelledError during await task escape unhandled. Changed to
BaseException to suppress any exception from the draining await.

Also fix test_simple_engine.py to use pytest.mark.anyio instead of
pytest.mark.asyncio (pytest-asyncio is not configured), and add the
anyio_backend fixture to conftest.py restricting to asyncio only since
trio is not installed.

* fix: preserve prompt token accounting after upstream refresh

* fix: restore specprefill fallback helper scope
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant